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Abstract 

This paper presents a partial solution to a component 
of the problem of lexical choice: choosing the syn- 
onym most typical, or expected, in context. We apply 
a new statistical approach to representing the context of 
a word through lexical co-occurrence networks. The im- 
plementation was trained and evaluated on a large cor- 
pus, and results show that the inclusion of second-order 
co-occurrence relations improves the performance of our 
implemented lexical choice program. 

1 Introduction 

Recent work views lexical choice as the process of map- 
ping from a set of concepts (in s ome representa tion of 
kno wledge) to a word or phrase ([Elhadad, 1992|; Stede, 
1996). When the same concept admits more than one 
lexicalization, it is often difficult to choose which of 
these 'synonyms' is the most appropriate for achieving 
the desired pragmatic goals; but this is necessary for 
high-quality machine translation and natural language 
generation. 

Knowledge-based approaches to representing the po- 
tentially subtle differences between synonyms have suf- 
fered from a serious lexical ac quisition bottleneck (D i- 
Marco, Hirst, and Stede, 1993; |Hirst, 1995| ). Statisti- 
cal approaches, which have sought to explicitly repre- 
sent differences between pairs of synonyms with respect 
to their occurrence with other specific words (Church et 
al., 1994), are inefficient in time and space. 

This paper presents a new statistical approach to mod- 
eling context that provides a preliminary solution to an 
important sub-problem, that of determining the near- 
synonym that is most typical, or expected, if any, in a 
given context. Although weaker than full lexical choice, 
because it doesn't choose the 'best' word, we believe that 
it is a necessary first step, because it would allow one to 
determine the effects of choosing a non-typical word in 
place of the typical word. The approach relies on a gener- 
alization of lexical co-occurrence that allows for an im- 
plicit representation of the differences between two (or 
more) words with respect to any actual context. 

For example, our implemented lexical choice program 
selects mistake as most typical for the 'gap' in sen- 
tence 1), and error in (2). 



(1) However, such a move also would run the risk 
of cutting deeply into U.S. economic growth, 
which is why some economists think it would 
be a big {error | mistake | oversight}. 

(2) The {error | mistake | oversight} was magnified 
when the Army failed to charge the standard 
percentage rate for packing and handling. 

2 Generalizing Lexical Co-occurrence 
2.1 Evidence-based Models of Context 

Evidence-based models represent context as a set of fea- 
tures, say words, that are observed to co-o ccur with, and 
thereby predi ct, a word ([Yarowsky, 1992|; Golding and 
Scha bes, 1996;|Karow and Edelman, 1996fc N g and Lee, 
1996). But, if we use just the context surrounding a word, 
we might not be able to build up a representation satis- 
factory to uncover the subtle differences between syn- 
onyms, because of the massive volume of text that would 
be required. 

Now, observe that even though a word might not co- 
occur significantly with another given word, it might 
nevertheless predict the use of that word if the two words 
are mutually related to a third word. That is, we can 
treat lexical co-occurrence as though it were moderately 
transitive. For example, in (3), learn provides evidence 
for task because it co-occurs (in other contexts) with dif- 
ficult, which in turn co-occurs with task (in other con- 
texts), even though learn is not seen to co-occur signifi- 
cantly with task. 

(3) The team's most urgent task was to learn 
whether Chernobyl would suggest any safety 
flaws at KWU-designed plants. 

So, by augmenting the contextual representation of a 
word with such second-order (and higher) co-occurrence 
relations, we stand to have greater predictive power, as- 
suming that we assign less weight to them in accordance 
with their lower information content. And as our results 
will show, this generalization of co-occurrence is neces- 
sary. 

We can represent these relations in a lexical co- 
occurrence network, as in figure fil that connects lexi- 
cal items by just their first-order co-occurrence relations. 




Figure 1: A fragment of the lexical co-occurrence net- 
work for task. The dashed line is a second-order relation 
implied by the network. 



Second-order and higher relations are then implied by 
transitivity. 

2.2 Building Co-occurrence Networks 

We build a lexical co-occurrence network as follows: 
Given a root word, connect it to all the words that sig- 
nificantly co-occur with it in the training corpus;[] then, 
recursively connect these words to their significant co- 
occurring words up to some specified depth. 

We use the intersection of two well-known measures 
of significance, mutu al information scores and f-scores 
dChurch et al., 1994 ), to determine if a (first-order) co- 
occurrence relation should be included in the network; 
however, we use just the f-scores in computing signif- 
icance scores for all the relations. Given two words, 
wo and Wd, in a co-occurrence relation of order d, and 
a shortest path P(wq, Wd) = (wo, ■ ■ ■ ,Wd) between them, 
the significance score is 



sig(w ,W d ) = £ 



t(Wi-l,W{) 



This formula ensures that significance is inversely pro- 
portional to the order of the relation. For example, in the 
network of figure |IJ sig(task, learn) — [t (task, difficult) + 
\t (difficult, learn)] /8 = 0.41. 

A single network can be quite large. For instance, the 
complete network for task (see figure [j]) up to the third- 
order has 8998 nodes and 37,548 edges. 



2.3 Choosing the Most Typical Word 

The amount of evidence that a given sentence provides 
for choosing a candidate word is the sum of the signifi- 
cance scores of each co-occurrence of the candidate with 
a word in the sentence. So, given a gap in a sentence S, 



Our training corpus was the part-of-speech-tagged 1989 Wall 
Street Journal, which consists of N = 2,709,659 tokens. No lemmati- 
zation or sense disambiguation was done. Stop words were numbers, 
symbols, proper nouns, and any token with a raw frequency greater 
than F = 800. 



Set 


POS 


Synonyms (with training corpus frequency) 


1 


JJ 


difficult (352), hard (348), tough (230) 


2 


NN 


error (64), mistake (61), oversight (37) 


3 


NN 


job (418), task (123), duty (48) 


4 


NN 


responsibility (142), commitment (122), 






obligation (96), burden (81) 


5 


NN 


material (177), stuff (79), substance (45) 


6 


VB 


give (624), provide (501), offer (302) 


7 


VB 


settle (126), resolve (79) 



Table 1 : The sets of synonyms for our experiment. 



we find the candidate c for the gap that maximizes 



M(c,S)= £sig(c,w) 

weS 



For example, given S as sentence (3), above, and the net- 
work of figure |l[ M(task, S) — 4.40. However, job (using 
its own network) matches best with a score of 5.52; duty 
places third with a score of 2.21. 

3 Results and Evaluation 

To evaluate the lexical choice program, we selected sev- 
eral sets of near-synonyms, shown in table [l], that have 
low polysemy in the corpus, and that occur with similar 
frequencies. This is to reduce the confounding effects of 
lexical ambiguity. 

For each set, we collected all sentences from the yet- 
unseen 1987 Wall Street Journal (part-of-speech-tagged) 
that contained any of the members of the set, ignoring 
word sense. We replaced each occurrence by a 'gap' that 
the program then had to fill. We compared the 'correct- 
ness' of the choices made by our program to the baseline 
of always choosing the most frequent synonym accord- 
ing to the training corpus. 

But what are the 'correct' responses? Ideally, they 
should be chosen by a credible human informant. But re- 
grettably, we are not in a position to undertake a study of 
how humans judge typical usage, so we will turn instead 
to a less ideal source: the authors of the Wall Street Jour- 
nal. The problem is, of course, that authors aren't always 
typical. A particular word might occur in a 'pattern' in 
which another synonym was seen more often, making it 
the typical choice. Thus, we cannot expect perfect accu- 
racy in this evaluation. 

Table || shows the results for all seven sets of syn- 
onyms under different versions of the program. We var- 
ied two parameters: (1) the window size used during the 
construction of the network: either narrow (±4 words), 
medium (± 10 words), or wide (± 50 words); (2) the 
maximum order of co-occurrence relation allowed: 1, 2, 
or 3. 

The results show that at least second-order co- 
occurrences are necessary to achieve better than baseline 
accuracy in this task; regular co-occurrence relations are 
insufficient. This justifies our assumption that we need 



Set 


1 


2 


3 


4 


5 


6 


7 


Size 


6665 


1030 


5402 


3138 


1828 


10204 


1568 


Baseline 


40.1% 


33.5% 


74.2% 


36.6% 


62.8% 


45.7% 


62.2% 


I 

i 




1 8 7% 


34 5% 


97 l°/r> 


98 8% 


33 9% 


/L1 3% 


in arrow z 


4.7 9% 
T- / .z /c 


AA S% 


f\f\ 9% 
OO.Z /C 




01.7/0 


ZL8 1 % 


69 8% a 


3 


47.9% 


48.9% 


68.9% 


44.3% 


64.6%" 


48.6% 


65.9% 


1 


24.0% 


25.0% 


26.4% 


29.3% 


28.8% 


20.6% 


44.2% 


Medium 2 


42.5% 


47.1% 


55.3% 


45.3% 


61.5%" 


44.3% 


63.6% fl 


3 


42.5% 


47.0% 


53.6% 










Wide 1 


9.2% 


20.6% 


17.5% 


20.7% 


21.2% 


4.1% 


26.5% 


2 


39.9% a 


46.2% 


47.1% 


43.2% 


52.7% 


37.7% 


58.6% 



"Difference from baseline not significant. 



Table 2: Accuracy of several different versions of the lexical choice program. The best score for each set is in boldface. 
Size refers to the size of the sample collection. All differences from baseline are significant at the 5% level according 
to Pearson's % 2 test, unless indicated. 



more than the surrounding context to build adequate con- 
textual representations. 

Also, the narrow window gives consistently higher ac- 
curacy than the other sizes. This can be explained, per- 
haps, by the fact that differences between near-synonyms 
often involve differences in short-distance collocations 
with neighboring words, e.g., face the task. 

There are two reasons why the approach doesn't do as 
well as an automatic approach ought to. First, as men- 
tioned above, our method of evaluation is not ideal; it 
may make our results just seem poor. Perhaps our results 
actually show the level of 'typical usage' in the newspa- 
per. 

Second, lexical ambiguity is a major problem, af- 
fecting both evaluation and the construction of the co 
occurrence network. For example, in sentence (3) 
above, it turns out that the program uses safety as evi- 
dence for choosing job (because job safety is a frequent 
collocation), but this is the wrong sense of job. Syntactic 
and collocational red herrings can add noise too. 

4 Conclusion 

We introduced the problem of choosing the most typical 
synonym in context, and gave a solution that relies on a 
generalization of lexical co-occurrence. The results show 
that a narrow window of training context (±4 words) 
works best for this task, and that at least second-order 
co-occurrence relations are necessary. We are planning 
to extend the model to account for more structure in the 
narrow window of context. 
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